Dependencies

Summary:

1. Data Exploration/Preparation

Download datasets here: https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks

Observations:

  1. This data is at a song level
  2. Many numerical values that I'll be able to use to compare movies (liveness, tempo, valence, etc)
  3. Release date will useful but I'll need to create a OHE variable for release date in 5 year increments
  4. Similar to 2, I'll need to create OHE variables for the popularity. I'll also use 5 year increments here
  5. There is nothing here related to the genre of the song which will be useful. This data alone won't help us find relavent content since this is a content based recommendation system. Fortunately there is a data_w_genres.csv file that should have some useful information

Observations:

  1. This data is at an artist level
  2. There are similar continuous variables as our initial dataset but I won't use this. I'll just use the values int he previous dataset.
  3. The genres are going to be really useful here and I'll need to use it moving forward. Now, the genre column appears to be in a list format but my past experience tells me that it's likely not. Let's investigate this further.

This checks whether or not genres is actually in a list format:

As we can see, it's actually a string that looks like a list. Now, look at the example above, I'm going to put together a regex statement to extract the genre and input into a list

Voila, now we have the genre column in a format we can actually use. If you go down, you'll see how we use it.

Now, if you recall, this data is at a artist level and the previous dataset is at a song level. So what here's what we need to do:

  1. Explode artists column in the previous so each artist within a song will have their own row
  2. Merge data_w_genre to the exploded dataset in Step 1 so that the previous dataset no is enriched with genre dataset

Before I go further, let's complete these two steps.

Step 1. Similar to before, we will need to extract the artists from the string list.

This looks good but did this work for every artist string format. Let's double check

So, it looks like it didn't catch all of them and you can quickly see that it's because artists with an apostrophe in their title and the fact that they are enclosed in a full quotes. I'll write another regex to handle this and then combine the two

Now I can explode this column and merge as I planned to in Step 2

Alright we're almost their, now we need to:

  1. Group by on the song id and essentially create lists lists
  2. Consilidate these lists and output the unique values

2. Feature Engineering

- Normalize float variables

- OHE Year and Popularity Variables

- Create TF-IDF features off of artist genres

3. Connect to Spotify API

Useful links:

  1. https://developer.spotify.com/dashboard/
  2. https://spotipy.readthedocs.io/en/2.16.1/

4. Create Playlist Vector

5. Generate Recommendations